The objective of this second assignment is to explore the differentially expressed genes from the cleaned and normalized data in assignmnent one. Then rank the thresholded over-representation analysis to highlight the top terms / dominant themes in the top set of genes. Lastly, compare my result with the original literature and find some other supports as well for my result if possible. I make some changes to my assignment one and stored the file as “ammended_A1.Rmd” and imported it here. I will demonstrate briefly what I have changed and I general workflow in A1.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("limma", quietly = TRUE))
BiocManager::install("limma")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("EnsDb.Hsapiens.v75", quietly = TRUE))
BiocManager::install("EnsDb.Hsapiens.v75")
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
if (!requireNamespace("gprofiler2", quietly = TRUE))
BiocManager::install("gprofiler2")
library(limma)
library(edgeR)
library(EnsDb.Hsapiens.v75)
library(knitr)
library(ComplexHeatmap)
library(circlize)
library(gprofiler2)
My data is from GSE84054
What I have changed in A1: I decided to leave the genes that have duplicated names because Ruth sugguested that they are a very small number compare to my total number of genes, ~1%. So I deleted many steps in A1, and made the cleaning step clearer by showing boxplots, density plots and MDS plots in each cleaning step.
Loaded my normalized data from A1.
normalized_count_data <- read.table(file="GSE84054_normalized_count.txt")
kable(normalized_count_data[1:5, 1:5], type="html")
| GENEID | SYMBOL | RHB037 | RHB038 | RHB041 | |
|---|---|---|---|---|---|
| ENSG00000000003 | ENSG00000000003 | TSPAN6 | 79.809196 | 63.375225 | 37.719858 |
| ENSG00000000419 | ENSG00000000419 | DPM1 | 36.461054 | 38.451386 | 49.614654 |
| ENSG00000000457 | ENSG00000000457 | SCYL3 | 12.060195 | 12.077110 | 6.329049 |
| ENSG00000000460 | ENSG00000000460 | C1orf112 | 6.762435 | 4.972927 | 3.848316 |
| ENSG00000000938 | ENSG00000000938 | FGR | 1.402348 | 3.966502 | 1.940060 |
# After normalization
data2plot_after <- log2(normalized_count_data[,3:ncol(normalized_count_data)])
{boxplot(data2plot_after, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Filtered and normalized RNASeq Samples")
abline(h = median(apply(data2plot_after, 2, median)), col = "green", lwd = 0.6, lty = "dashed")}
counts_density <- apply(log2(normalized_count_data[,3:ncol(normalized_count_data)]), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM",cex.lab = 0.85,
main = "Filtered and normalized RNASeq Samples distribution")
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot), col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4", merge = TRUE, bg = "gray90")
heatmap_matrix <- normalized_count_data[,3:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GENEID
colnames(heatmap_matrix) <- rownames(samples_filtered)
# MDS plot by "cell_type" in samples
plotMDS(heatmap_matrix, labels=rownames(samples_filtered),
col = c("darkgreen","blue")[factor(samples_filtered$cell_type)],
main = "MDS plot depending on cell type")
pat_colors <- rainbow(12)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
# MDS plot by "cell_type" + "patients"in samples
plotMDS(heatmap_matrix, col = pat_colors,
main = "MDS plot depending on both cell type and patients")
Based on the two models in part1, I decide to base my model only on “cell_type”
model_design <- model.matrix(~ samples$cell_type)
kable(model_design, type="html")
| (Intercept) | samples$cell_typeSphere |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
There are 6988 genes that pass the p-value = 0.05 which is chosen based on the paper.
expressionMatrix <- as.matrix(normalized_count_data[,3:ncol(normalized_count_data)])
rownames(expressionMatrix) <- normalized_count_data$GENEID
colnames(expressionMatrix) <- colnames(normalized_count_data)[3:ncol(normalized_count_data)]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
# fit
fit <- lmFit(minimalSet, model_design)
# Use Bayes
fit2 <- eBayes(fit,trend=TRUE)
# Correction: BH
topfit <- topTable(fit2, coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1:2], topfit, by.y = 0, by.x = 1, all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
kable(output_hits[1:5,],type="html")
| GENEID | SYMBOL | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 16973 | ENSG00000235865 | GSN-AS1 | -4.9824524 | 2.8883230 | -10.370167 | 0 | 6.20e-06 | 5.229345 |
| 3787 | ENSG00000113575 | PPP2CA | 101.3281947 | 88.8886109 | 9.056963 | 0 | 4.02e-05 | 4.390996 |
| 19346 | ENSG00000267904 | CTC-429P9.5 | -0.8873553 | 0.6414619 | -8.675083 | 0 | 5.57e-05 | 4.108874 |
| 3158 | ENSG00000108091 | CCDC6 | 78.2685270 | 64.3139803 | 8.563287 | 0 | 5.57e-05 | 4.022672 |
| 6795 | ENSG00000138600 | SPPL2A | 22.4430356 | 22.1442964 | 8.151324 | 0 | 9.54e-05 | 3.690225 |
# number of genes that pass threshold p-value = 0.05
length(which(output_hits$P.Value < 0.05)) # 6988
# number of genes that pass correction
length(which(output_hits$adj.P.Val < 0.05)) # 4062
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d <- estimateDisp(d, model_design_pat)
I have shown that my data is suitable for using edgeR for further analysis. The data follows the binomial distribution.
plotMeanVar(d, show.raw.vars = TRUE,
show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline=TRUE,
show.binned.common.disp.vars = TRUE,
main = "Binomial distribution of my data")
The individual dots represent each gene and the blue line is the overall trend line.
plotBCV(d,col.tagwise = "black",col.common = "red",
main = "BCV plot of RNA-seq data")
I used Quasi-likelihood models to fit my data and used QLFTest to test for differential expression. The Quasi-likelihood compares two conditions (primary tumour and tumoursphere) and shows the up and down-regulated genes. I have showed the result below that are sorted by p-value. I also inspected the number of genes that satisty my threshold and correction. I choose to use FDR correction based on the paper as well(Goh et al. 2017) . There are 7467 genes pass the p-value = 0.05, and 5033 genes that pass the FDR correction.
fit <- glmQLFit(d, model_design)
qlf.sphere_vs_tumour <- glmQLFTest(fit, coef='samples$cell_typeSphere')
kable(topTags(qlf.sphere_vs_tumour), type="html")
| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| ENSG00000133574 | -10.864552 | 4.3344790 | 211.3416 | 0 | 0 |
| ENSG00000106483 | -12.504407 | 5.1136480 | 184.1976 | 0 | 0 |
| ENSG00000179144 | -10.957441 | 4.7967197 | 159.9882 | 0 | 0 |
| ENSG00000163563 | -9.261757 | 3.3981949 | 156.9675 | 0 | 0 |
| ENSG00000130300 | -12.870252 | 5.0189032 | 153.4504 | 0 | 0 |
| ENSG00000211668 | -11.278848 | 3.4432853 | 146.3944 | 0 | 0 |
| ENSG00000147113 | -10.742034 | 3.3588694 | 141.1675 | 0 | 0 |
| ENSG00000167286 | -10.447530 | 2.6237953 | 140.2040 | 0 | 0 |
| ENSG00000120279 | -9.829420 | 2.4408378 | 138.9715 | 0 | 0 |
| ENSG00000106952 | -7.721763 | -0.0200685 | 138.2581 | 0 | 0 |
| x |
|---|
| BH |
| x |
|---|
| samples$cell_typeSphere |
| x |
|---|
| glm |
# Get all the results
qlf_output_hits <- topTags(qlf.sphere_vs_tumour,
sort.by = "PValue",
n = nrow(normalized_count_data))
# Number of genes that pass the threshold p-value = 0.05
length(which(qlf_output_hits$table$PValue < 0.05)) # 7467
# Number of genes that pass correction
length(which(qlf_output_hits$table$FDR < 0.05)) # 5033
I determined the number of up-regulated genes by selecting every gene that does not pass my p-value: 0.05, and also have a positive log fold change. Down-regulated genes are selected in the same way with a negative log fold change. Stored these data for later enrichment analysis.
# How many genes are up regulated?
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)) # 1897
# How many genes are down regulated?
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0)) # 5570
# Get those up and down-regulated genes
qlf_output_hits_withgn <- merge(expr_filtered[,1:2],qlf_output_hits, by.x=1, by.y = 0)
upregulated_genes <- qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)]
downregulated_genes <-qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0)]
# store data
unreg_genes_copy <- data.frame(upregulated_genes)
downreg_genes_copy <- data.frame(downregulated_genes)
names(unreg_genes_copy) <- names(downreg_genes_copy)
all_de <- rbind(unreg_genes_copy, downreg_genes_copy)
colnames(all_de) <- "all_de"
write.table(x=all_de,
file="all_expr_de_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=upregulated_genes,
file="expr_upregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file="expr_downregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
I have shown the up and down-regulated genes in a volcano plot by coloring them in red and blue, the code is from (Annick Moisan, n.d.)
volcanoData <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(volcanoData) <- c("logFC", "Pval")
up <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 0
point.col <- ifelse(up, "red", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Up-regulated genes in RNA-seq data")
down <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(down, "blue", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Down-regulated genes in RNA-seq data")
To test the differential expression, I used the heatmap and it has shown a clear distinction between up and down regulated genes. There is a clear difference between the primary tumour samples and tumoursphere samples.(They are reversed.) The clustering is very obvious to note that differential expression exists. s
top_hits <- rownames(qlf_output_hits$table)[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits <- heatmap_matrix_tophits[, c(grep(colnames(heatmap_matrix_tophits),pattern = "_P"),
grep(colnames(heatmap_matrix_tophits),pattern = "_S"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,)
current_heatmap
There were no REAC found. I used the gprofiler2’s function to query data and also attached the screenshots that I took on their website since the package does not show the number of gene sets each has found.
The some analysis is apply to down regulated
upregulated_genes_sym <- qlf_output_hits_withgn$SYMBOL[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 1)]
length(upregulated_genes_sym) # 1312
upregulated_genes_sym[grep(pattern="ALDH",upregulated_genes_sym)]
# ALDH2, ALDH8A1, ALDH1L2
Annick Moisan, Nathalie Villa-Vialaneix, Ignacio Gonzales. n.d. “Practical Statistical Analysis of Rna-Seq Data - edgeR - Tomato Data.”
Goh, Jian Yuan, Min Feng, Wenyu Wang, Gokce Oguz, Siti Maryam JM Yatim, Puay Leng Lee, Yi Bao, et al. 2017. “Chromosome 1q21. 3 Amplification Is a Trackable Biomarker and Actionable Target for Breast Cancer Recurrence.” Nature Medicine 23 (11). Nature Publishing Group: 1319.
Vasiliou, Vasilis, and Daniel W Nebert. 2005. “Analysis and Update of the Human Aldehyde Dehydrogenase (Aldh) Gene Family.” Human Genomics 2 (2). BioMed Central: 138.
Vassalli, Giuseppe. 2019. “Aldehyde Dehydrogenases: Not Just Markers, but Functional Regulators of Stem Cells.” Stem Cells International 2019. Hindawi.